Covid Data Trends and Statistics

By: Jacob Siegel

Abstract

There is a wide rage in the death rate attributed to the Corona Virus between the US states in the time frame from the onset of the Pandemic to August 2021. The range is from 0.038 % of the population in Hawaii to 0.3% of the population in New York, nearly an order of magnitude difference. To understand this range in outcomes, a state-level dataset was assemble with various attributes pertaining to each state such as average obesity, average GDP, education level, race, population density, political party of the governor, etc. This dataset was then used as the input for sklearn’s linear regression model RidgeCV, a ridge regression with build in cross-validation. This analysis shows that many factors contributed to this range in death rate with the most salient features being the population density, obesity rate, and education level of the state. The political party of the state’s leadership (Governor as of 2020) does not appear to have a strong influence on the death rate, however, news and social media have fomented an intense belief that one political party is better at managing the corona virus than another, creating a strong divide and leaving both sides attributing failure to the other.

1.0 Data Overview

The data is a state-level overview of statistics obtained from a variety of sources including the CDC, Bureau of Labor Statistics, KFF and Wikipedia. The columns and source are as follows:

Columns:

Sources:

A second data set of deaths recorded in 1 week during August to highligh recent changes and more vacinations.

1.1 Feature Engineering

2.0 EDA: Univariate Analysis With a Focus on Political Party

Observation:

Observation:

Observations:

2.1 Distribution of variables

Calculate the variabes with the biggest difference (T Stat) do display

3.0 EDA: Multi Variate

There is an outlier in the Asian column

Remove Hawaii as an outlier for asian population

4.0 Linear Regression Modeling

4.1 Create Dependent (Y) and Independendt (X) variables

Create dependent(y) and independent (x) variables. X is then scaled per column.

Check Multicollinearity using VIF scores

College and Life_Expepectancy have the highest VIF scores and will be removed. Several atributes still have scores that are considered high (above 10),

Drop Med_Age

Rerun the model with the Y variable of percentage deaths in last 7 days

5 Logistics Regression Model

A fairly even split of states that are above (1) and below (0) the national average